class: center, middle, hide-logo <style type="text/css"> pre { background: #F8F8F8; max-width: 100%; overflow-x: scroll; } </style> <style type="text/css"> .scroll-output { height: 80%; overflow-y: scroll; } </style> # First Machine Learning Workshop ## by <img src="GraphicsSlides/Logo RUG hell.png" width="50%" /> ##### Author/Presenter: Ruben Ernst/Mathias Steilen ##### Last updated: _2022-11-24 11:47:23_ --- ### Today's Mission <br> <br> .center[ <img src="GraphicsSlides/get in loser.png" width="60%" /> ] Courtesy of the TidyTuesday project - Check it out! --- # Background > The Great Pumpkin Commonwealth's (GPC) mission cultivates the hobby of growing giant pumpkins throughout the world by establishing standards and regulations that ensure quality of fruit, fairness of competition, recognition of achievement, fellowship and education for all participating growers and weigh-off sites. _[Link to Website](https://gpc1.org/)_ .center[ <img src="GraphicsSlides/GPCJoinPagecoop.jpg" width="50%" /> ] --- # Let's look at the files .panelset[ .panel[.panel-name[training] ```r training <- read_csv("./Data/training.csv") ``` ``` ## Rows: 8745 Columns: 12 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (6): type, grower_name, city, state_prov, country, gpc_site ## dbl (6): year, place, weight_kg, ott, est_weight, id ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` This file will be used for training/fitting your model. ] .panel[.panel-name[testing] ```r holdout <- read_csv("./Data/holdout.csv") ``` ``` ## Rows: 2187 Columns: 11 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (6): type, grower_name, city, state_prov, country, gpc_site ## dbl (5): year, place, ott, est_weight, id ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` This file will be used to make predictions on. There is no target variable in here, so there won't be data leakage during training. However, before submitting your predictions, please follow the sample submission format. ] .panel[.panel-name[sample_submission] ```r sample_submission <- read_csv("./Data/sample_submission.csv") ``` ``` ## Rows: 2187 Columns: 2 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## dbl (2): id, weight_kg ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` **Important**: Your submission to our email address must adhere to this format (CSV file). ] ] --- # Our Basic Example **Disclaimer** -- .pull-left[ Some of you might feel like this: .center[ <img src="GraphicsSlides/sad-cry.gif" width="75%" /> ] ] -- .pull-right[ And some of you might feel like this: <img src="GraphicsSlides/kanye-west-bored-gif.webp" width="100%" /> ] -- The learning curve is always steep when looking at it from the bottom. Use the time later to ask your more experienced peers (or us) questions. --- #### Our Basic Example: The ol' reliable$ Make splits from the training first: ```r dt_split <- initial_split(training) dt_train <- training(dt_split) dt_test <- testing(dt_split) folds <- vfold_cv(dt_train, v = 5) # resampling for tuning ``` ```r dt_split ``` ``` ## <Training/Testing/Total> ## <6558/2187/8745> ``` --- #### Our Basic Example: The ol' reliable Let's fit a basic, linear regression with a penalty. ```r lin_spec <- linear_reg(mixture = tune(), penalty = tune()) %>% set_mode("regression") %>% set_engine("glmnet") ``` ```r lin_rec <- recipe(weight_kg ~ year + place + ott + est_weight + country, data = training) %>% step_impute_mean(all_numeric_predictors()) %>% step_novel(all_nominal_predictors()) %>% step_unknown(all_nominal_predictors(), new_level = "not specified") %>% step_other(country, threshold = 0.03) %>% step_dummy(all_nominal_predictors(), one_hot = T) %>% step_rm(country_other) ``` --- #### Our Basic Example: The ol' reliable ```r lin_rec %>% prep() %>% juice() ``` ``` ## # A tibble: 8,745 × 11 ## year place ott est_weight weight…¹ count…² count…³ count…⁴ count…⁵ count…⁶ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013 356 940. 241. 453. 0 0 0 0 0 ## 2 2013 356 904. 451. 453. 0 0 0 0 0 ## 3 2013 358 917. 241. 453. 0 0 0 0 0 ## 4 2013 359 892. 434. 453. 0 0 0 0 0 ## 5 2013 361 917. 241. 453. 0 1 0 0 0 ## 6 2013 363 706. 241. 452. 0 0 0 0 0 ## 7 2013 363 874. 408. 452. 0 0 0 0 0 ## 8 2013 363 963. 241. 452. 0 1 0 0 0 ## 9 2013 366 879. 416. 451. 0 1 0 0 0 ## 10 2013 368 706. 241. 451. 0 0 0 0 0 ## # … with 8,735 more rows, 1 more variable: country_United.States <dbl>, and ## # abbreviated variable names ¹weight_kg, ²country_Austria, ³country_Canada, ## # ⁴country_Germany, ⁵country_Italy, ⁶country_Japan ``` --- #### Our Basic Example: The ol' reliable .scroll-output[ ```r lin_wf <- workflow() %>% add_recipe(lin_rec) %>% add_model(lin_spec) lin_wf ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 6 Recipe Steps ## ## • step_impute_mean() ## • step_novel() ## • step_unknown() ## • step_other() ## • step_dummy() ## • step_rm() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = tune() ## mixture = tune() ## ## Computational engine: glmnet ``` ] --- #### Our Basic Example: The ol' reliable Let's tune the penalty: ```r lin_tune_results <- tune_grid( lin_wf, resamples = folds, grid = grid_regular(penalty(), mixture(), levels = 10) ) ``` --- #### Our Basic Example: The ol' reliable .scroll-output[ Let's look at the results: ```r lin_tune_results %>% show_best(metric = "rsq") ``` ``` ## # A tibble: 5 × 8 ## penalty mixture .metric .estimator mean n std_err .config ## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.0000000001 0.111 rsq standard 0.870 5 0.00374 Preprocessor1_Mo… ## 2 0.00000000129 0.111 rsq standard 0.870 5 0.00374 Preprocessor1_Mo… ## 3 0.0000000167 0.111 rsq standard 0.870 5 0.00374 Preprocessor1_Mo… ## 4 0.000000215 0.111 rsq standard 0.870 5 0.00374 Preprocessor1_Mo… ## 5 0.00000278 0.111 rsq standard 0.870 5 0.00374 Preprocessor1_Mo… ``` ] --- #### Our Basic Example: The ol' reliable Let's finalise the model with the best parameters from tuning: ```r lin_fit <- lin_wf %>% finalize_workflow(select_best(lin_tune_results, metric = "rsq")) %>% fit(dt_train) ``` Fitting onto the training split. --- #### Our Basic Example: The ol' reliable ```r lin_fit %>% predict(dt_test) ``` ``` ## # A tibble: 2,187 × 1 ## .pred ## <dbl> ## 1 424. ## 2 488. ## 3 423. ## 4 424. ## 5 494. ## 6 441. ## 7 439. ## 8 439. ## 9 307. ## 10 484. ## # … with 2,177 more rows ``` --- #### Our Basic Example: The ol' reliable ```r lin_fit %>% augment(dt_test) %>% rsq(truth = weight_kg, estimate = .pred) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rsq standard 0.860 ``` --- #### Our Basic Example: The ol' reliable Happy with that? Fit it on the entire training data provided and then make predictions for your final submission: ```r final_model <- lin_wf %>% finalize_workflow(select_best(lin_tune_results, metric = "rsq")) %>% * fit(training) ``` --- #### Our Basic Example: The ol' reliable .scroll-output[ Make predictions and save the results as _.csv_. Then submit your predictions to us and we will score them. You can submit as often as you like, and we'll give you information about your performance on the holdout, as we have the target values. ```r final_model %>% augment(holdout) %>% select(id, .pred) %>% rename(weight_kg = .pred) ``` ``` ## # A tibble: 2,187 × 2 ## id weight_kg ## <dbl> <dbl> ## 1 10194 491. ## 2 9228 451. ## 3 10361 487. ## 4 10033 469. ## 5 10369 490. ## 6 8875 489. ## 7 10203 487. ## 8 9466 441. ## 9 9850 483. ## 10 9931 439. ## # … with 2,177 more rows ``` ] --- #### Our Basic Example: The ol' reliable .scroll-output[ ```r final_model %>% augment(read_csv("./Data/holdout_with_target.csv", show_col_types = F)) %>% ggplot(aes(weight_kg, .pred)) + geom_point(alpha = 0.2) + geom_abline(lty = "dashed", colour = "red") ``` <img src="Giant-Pumpkins_files/figure-html/unnamed-chunk-24-1.png" width="100%" /> ] --- ### You won't have the target variable on the holdout data set for two reasons -- .pull-left[ .center[ **Reason 1:** <br> <img src="GraphicsSlides/roll safe.png" width="100%" /> ] ] -- .pull-right[ .center[ **Reason 2:** <br> <img src="GraphicsSlides/pumpkin spice.jpg" width="60%" /> ] ] There's something to win here - so play fair. --- ### Off you go Have a look at our example for dealing with splits and hyperparameter tuning in the Tidymodels tutoring session. We'll be here for you to ask questions, once you get to it. Copying and pasting the code in the slides is **allowed**! Spend as long as you need modelling. #### 🕒 20:00 --- # That's it for today! After our session: Watch videos from Julia Silge, Andrew Couch and David Robinson. Most importantly, have fun while learning. For further questions, feel free to reach out to us. Make sure to stay updated on our socials and via our website where all resources and dates are also published. <br> .center[ <img src="GraphicsSlides/Logo RUG hell.png" width="60%" /> **[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)** ] --- class: middle, inverse, hide-logo # Thank you for attending!
The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.